This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
These notebooks are typically this is designed to create a pleasing viewing environment of data analysis that allows you to include figures, text, links, etc. so that your work is better understood and can be reproduced and used with confidence.
The source code for this R notebook (Rmd suffixed files), when stored as web pages (html files), can be downloaded by clicking the button at the top of the page.
If viewing the source code in R Studio, try executing each R “chunk” by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter. z
Warning. Typos are Legion!
1. Introduction
When you’re in MATH 381 (Intro to Probability and Stats) you’ll get a taste of R. R is an open-source statistical package build off of an earlier generation of commercial.
The goal here is to demonstrate cracking open an excel spreadsheet in R and calculate some basic stats, create various plots to view the statistics, and finally, do some linear and multivariate regression
Another goal here is to show off some of R’s features. R is a very powerful tool. When translating “powerful” from computereese to any frustrated human dialect, that means “steep learning curve.” It’s also a community-supported environment. When translating “powerful” from computereese to any overscheduled human dialect, that means “there are LOTS of people donating packages and libraries to R.” Some have evolved to be a standard in the community. Others are highly specialized for a given discipline (but have one or two items that people outside their user communities find handy.)
But don’t let that intimidate you. Once you learn one language you can slowly pick up more. Also with this demo we aren’t going to get to to be an R guru in a day.
If you want a good stepping off point to learn R I’d recommend some of the resources at Data Camp which have some free starter tutorials for R.
2. Loading the Libraries
To work with R we will first have to load some libraries. This is like in C where you have the #include statement to do things like raise things to powers and stuff like that.
Some of these libraries or “packages” come with R. Others will have to be installed. Here are the ones we are using for this exercise.
Also in this exercise, we’re going to use the tidyverse set of packages. Tidyverse is a set of co-developed tools for data science in R. This is the new big thing in R and is widely used so we are just going to jump in here. SD Mines has a course beyond Engineering Stats, MATH 443/543 (Data Analysis) that leverages this set of packages.
- Install Us First
- tidyverse : Set of commonly-used Data Science packages for R that it can install and load all at once. In the long-run you probably also want to install the tidyverse package suite anyway. For this exercise this will include…
- ggplot2 : Create Elegant Data Visualizations Using the Grammar of Graphics
- tibble : Simple Data Frames
- tidyr : Tools for shepherding data in data frames.
- readr : Read Rectangular Text Data
- purr : Functional Programming Tools
- dplyr : A grammar of data manipulation
- stringr : Simple, Consistent Wrappers for Common String Operations
- forcats : Tools for Working with Categorical Variables (Factors)
- readxl : also part of the tidyverse package suite for reading traditional excel spreadsheets.
- moderndive : Tidyverse-Friendly Introductory Linear Regression
- This should come with R’s core install, if not install ’em.
- MASS : Has a lot of resources for regression.
- This doesn’t come with R’s core install so install that one…
- moments : This has a load of good stuff for data analysis and plotting, more than you will need here, but get it anyway.
- This is a nice contributed library that lets us make pretty statistics tables. It was written for ecological applications but it’s still pretty handy for looking at concrete
- pastecs: Package for Analysis of Space-Time Ecological Series
- Another nice contributed library that makes matrices of correlation coefficients look pretty (and graphically informative).
- corrplot Visualization of a Correlation Matrix
- While not officially needed for this activity but I’ll demonstrate how units can be used in R in this example
- udunits2 Provides simple bindings to Unidata’s udunits library for unit conversions (will be demonstrating but not explicity needing it here)
- units Provides Measurement Units for R Vectors
# Tidyverse Handling Libraries
library(package = "tidyverse") # main tidyverse suite
library(package = "readxl") # Read Excel Files
library(package = "moderndive") # regression support
# Statistics Libraries
library(package = "moments") # Moments, cumulants, skewness, kurtosis and related tests
library(package = "MASS") # Support Functions and Datasets for Venables & Ripley's MASS text
# Extra Graphics Libraries
library(package = "corrplot") # Visualization of a Correlation Matrix
# Data Processing Libraries
library(package = "pastecs") # Package for Analysis of Space-Time Ecological Series
library(package = "udunits2") # Unit Conversion Support
# library(package = "units") # Measurement Units for R Vectors
3. Cracking a Spreadsheet
The spreadsheet example below is a more complicated than what you hopefully have.
The original data set is from a set of papers on Concrete by I-Cheng Yeh
Yeh, I-Cheng, “Modeling slump of concrete with fly ash and superplasticizer,” Computers and Concrete, 5(6), 559-572, 2008. doi: 10.12989/cac.2008.5.6.559.
Yeh, I-Cheng, “Simulation of concrete slump using neural networks,” Construction Materials, 162(1), 11-18, 2009. doi: 10.1680/coma.2009.162.1.11
Yeh, I-Cheng, “Prediction of workability of concrete using design of experiments for mixtures,” Computers and Concrete, 5(1), 1-20, 2008. doi: 10.12989/cac.2008.5.1.001
Yeh, I-Cheng, “Modeling slump flow of concrete using second-order regressions and artificial neural networks,” Cement and Concrete Composites, 29(6), 474-480, 2007. doi: 10.1016/j.cemconcomp.2007.02.001
Yeh, I-Cheng, “Exploring concrete slump model using artificial neural networks,” ASCE J. of Computing in Civil Engineering, 20(3), 217-221, 2006. doi: 10.1061/(ASCE)0887-3801(2006)20:3(217)
and is kept at the UC-Irvine Machine Learning Repository.
It can be found here at http://kyrill.ias.sdsmt.edu/cee_284/Base_Concrete_Slump_Test_for_R.xlsx
The relevant page and screenshot is below. For drama-free R import you are probably best off keeping a page on your spreadsheet file that is very simple, with numbers going down, and a single line for Row-1 with the headers of each column. If you want to get fancy on other pages that you’d turn in as tables in reports, you can do that on another spreadsheet page.
To crack open the spreadsheet we will want to use the read_excel function.
You can read the spreadsheet from a local drive or from a website.
# you will need the full path to the file you are using (either online or locally on your disk)
# The if else block should query your machine to determine which operating system.
# if you are not bi-platform, you likely don't need this.
if(.Platform$OS.type == "windows") {
# Windows
spreadsheet_name = "%HOMEPATH%/Downloads/Base_Concrete_Slump_Test_for_R.xlsx"
} else {
# Unix (Linux, MacOS, Solaris)
spreadsheet_name = "~/Downloads/Base_Concrete_Slump_Test_for_R.xlsx"
}
# I am keeping a copy of these spreadsheet at the URL below. It can be downloaded automatically
# and then loaded. We can also discretely delete it when done.
spreadsheet_url = "http://kyrill.ias.sdsmt.edu/wjc/eduresources/Base_Concrete_Slump_Test_for_R.xlsx"
download.file(url = spreadsheet_url, # URL location
destfile = spreadsheet_name) # local downloaded location
trying URL 'http://kyrill.ias.sdsmt.edu/wjc/eduresources/Base_Concrete_Slump_Test_for_R.xlsx'
Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 18736 bytes (18 KB)
==================================================
downloaded 18 KB
remove(spreadsheet_url) # clean up variables
# this command will read the file
concrete = read_excel(path = spreadsheet_name, # remove spreadsheet location
sheet = "Data", # page of spreadsheet
col_names = TRUE) # first row are the column headers
# clean up your hard drive! Don't be like me!
if(.Platform$OS.type == "windows") {
# Windows
system(str_c("DEL ",
spreadsheet_name,
sep=""))
} else {
# Unix (Linux, MacOS, Solaris)
system(str_c("rm -v ",
spreadsheet_name,
sep=""))
}
/Users/wjc/Downloads/Base_Concrete_Slump_Test_for_R.xlsx
remove(spreadsheet_name) # clean up variables
With the data read in we can now look at the table of the data. This looks much nicer when working in R Notebooks instead of Plain Ordinary R.
# Print data frame
colnames(concrete)[1] = "Test_Number"
print(concrete)
NA
4. Some Basic Statistics and Traditional Single Variable Plots
Lets start with some basic statistics and plotting of them.
4.1. The “classic” stats
Let’s get the mom-and-apple-pie stats for Concrete That second argument allows you to deal with missing data.
# statistics for cement
print(str_c(" Mean Cement : ",
mean(x = concrete$Cement, # variable to crunch
na.rm = TRUE) # ignore msissing data
))
[1] " Mean Cement : 229.894174757282"
print(str_c(" Stdev Cement : ",
sd(x = concrete$Cement, # variable to crunch
na.rm = TRUE) # ignore msissing data
))
[1] " Stdev Cement : 78.8772300268858"
print(str_c("Skewness Cement : ",
skewness(x = concrete$Cement, # variable to crunch
na.rm = TRUE) # ignore msissing data
))
[1] "Skewness Cement : 0.143018080025135"
print(str_c("Kurtosis Cement : ",
kurtosis(x = concrete$Cement, # variable to crunch
na.rm = TRUE) # ignore msissing data
))
[1] "Kurtosis Cement : 1.33448397363582"
OK this is a little clunky. It would be nice if someone somewhere made a support library for R that will make nice tables of statistics.
In this case Vive La France! A team from French Research Institute for Exploitation of the Sea thought the same question and as is often the case for the R community not only drafted a set of tools to do this, and made it public.
Here we ware using their stat.desc function.
This will hopefully give people wanting to make basic tables “maximum satisfaction with minimal effort.”
# Plot a statistics table -- all the classics nice and handy and pretty.
options(digits=2) # this simply set the decimal count in the table to be created below
# this particular function creates the table in scientific notation
concrete_statistics = stat.desc(x = concrete, # data frame
basic = TRUE, # includes counts and extremes
desc = TRUE, # include classic stats (mean etc)
norm = TRUE, # include normal dist stats (skewness etc)
p = 0.95) # use 95% confidence limits
print(concrete_statistics)
NA
4.2. Reorganizing Your Data to Handle Multiple Variables at Once
To leverage some of R’s more nifty features we will need to reorganize our data from a “spreadsheet style” format to what some people have called a “long form” table so that the column headers of our concrete traits become a single column with the values in the columns placed all into a single column similar to the graphic below.
This is done with the function gather()
# Gathering our components into a single column.
# We just want the names of our components here so we get everything past
# the first column (which is the experiment name)
column_names = colnames(concrete[2:ncol(concrete)])
tbl_df(column_names) # tbl_df makes it look pretty when printed
# the gather command will group everything. in the column name group
concrete_tidy = gather(data = concrete, # your data frame
key = "Parameter", # column name for your former columns
value = "Value", # column name for your data
column_names ) # the list for the columns to "gather"
# this will let us sort future plots in the same order as our plots.
concrete_tidy$Parameter = factor(x = concrete_tidy$Parameter,
levels = column_names)
# we can also split things between our dependant variables and independant variables.
concrete_independent = subset(x = concrete_tidy,
subset = (Parameter != "Slump") &
(Parameter != "Flow") &
(Parameter != "Compressive_Strength_28dy")
)
concrete_dependent = subset(x = concrete_tidy,
subset = (Parameter == "Slump") |
(Parameter == "Flow") |
(Parameter == "Compressive_Strength_28dy")
)
print(concrete_tidy)
print(concrete_independent)
print(concrete_dependent)
NA
NA
5. Plotting Graphics using Tidyverse Resources
R has a few ways to do the basic histograms, Boxplots and other distribution plots.
There are a number of spiffy ways to plot these statistical plots in R. We’re just using one here…
5.1. SLOOOOWWWWLLLLLYYY Making a Simple Plot (Histogram Edition)
Now I’m going to do this one tiny step at a time until we get to a viable product. (This is how I work through cryptic procedures so I can see what each little additional mystery thingie does.)
Graphing is invoked by the ggplot2 command.. which has a heluvalot under its hood! For me all that detail was what had me a little shy to adopt this way of printing data.
Tidyverse uses what is sometimes called the “grammar of graphics” method… to make a long story longer, the GoG presents separate commands to do separate things rather bundle stuff in a single graphing function. Sometimes it makes a lot of sense… other times it may be confusion. (Hence me demonstrating making a graph this one tiny step at a time!
First thing we are going to do is open a plotting space with the command ggplot()
# invoke the ggplot plotting environmnent.
ggplot()

Wow. We have a… big square of… grey. All it’s doing is setting up our plot environment… so let’s do some more…
If we want to do a histogram we are going to have to tell it what we want to print and where to get the stuff
When we add things to a plot command in Tidyverse we “add” to the steps incrementally.
This involves a “mapping” function called “aes” (short for aesthetics)
here, we are working with the data frame “concrete” and are working on the variable Cement which we are tossing onto the x axis because that’s where the bins of cement go!
ggplot(data = concrete) + # EDIT: invoke graphics environment using a given dataframe
aes(x = Cement) # NEW: select variable to print... You can get really fancy here later

OK now we have something that looks like we may have the making of the graph. If you don’t like grey outlines and white grids, no worries, we can change that shortly.
OK.. we are now ready to make a histogram…
Here we will use one of the gglot2’s "geom_*" (draw stuff) resources. The default should work for us here.
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
aes(x = Cement) + # select variable to print... You can get really fancy here later
geom_histogram() # NEW: insert histogram

(you may have gotten a warning about using the bin=X, you can adjust it.)
Now quickly before moving on… I am not keen on the grey background with white lines.
There are a number of out-of-the-box “themes” for ggplot2.
I’m partial to theme_bw() and theme_light() but try the ones that you prefer or stick with the default, theme_gray().
These plots shown here are mine. You should fidget about so they are yours and so you can adapt to this new way of working with data.
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw() + # NEW: changing the plotting theme
aes(x = Cement) + # select variable to print... You can get really fancy here later
geom_histogram() # insert histogram (including controlling number of bins)

My OCD hates axes where the labels don’t envelop all of the data…
We can fix that with xlim() or ylim()
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw() + # changing the plotting theme
aes(x = Cement) + # select variable to print... You can get really fancy here later
xlim( 100, 400 ) + # NEW: adding x-axis limits
geom_histogram() # insert histogram

How about changing the color of the fill in the bars…
You really don’t want to know about all the colors you can use.
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw() + # changing the plotting theme
aes(x = Cement) + # select variable to print... You can get really fancy here later
xlim( 100, 400 ) + # NEW: adding x-axis limits
geom_histogram(fill="gray") # EDIT: insert histogram (with a single chosen color)

Want to customize the labels and titles so we can have units?
You can add custom labels and titles! (https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf)
For the superscripting in the x-axis label, I am using the expression() tool in R.
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw() + # changing the plotting theme
aes(x = Cement) + # select variable to print... You can get really fancy here later
xlim( 100, 400 ) + # adding x-axis limits
ggtitle("Yeh Superplasticizer Tests") + # NEW : Custom Title
xlab(expression('Cement Amount (kg m'^-3*")")) + # NEW : Custom Axis Label
geom_histogram(fill="gray") # insert histogram (with a single chosen color)

And I could keep tweaking this graph all day, but good enough is good enough so this is a good place to stop…
We also can plot a few other fields with some trial and error..
# Histogram of Water
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw() + # changing the plotting theme
aes(x = Water) + # select variable to print... You can get really fancy here later
xlim( 150, 250 ) + # adding x-axis limits
ggtitle("Yeh Superplasticizer Tests") + #Custom Title
xlab(expression('Water Amount (kg m'^-3*")")) + # NEW : Custom Axis Label note use of superscripts from above
geom_histogram(fill="blue") # insert histogram (with a single chosen color)

# Histogram of Strength
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw() + # changing the plotting theme
aes(x = Compressive_Strength_28dy) + # select variable to print... You can get really fancy here later
xlim( 10, 60 ) + # adding x-axis limits
ggtitle("Yeh Superplasticizer Tests") + #Custom Title
xlab("28-dy Compressive Strength (MPa)") + # NEW : Custom Axis Label
geom_histogram(fill="red") # insert histogram (with a single chosen color)

(And from our Intro to Stats Lecture…)
# Histogram of Strength
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw() + # changing the plotting theme
aes(x = Slump) + # select variable to print... You can get really fancy here later
xlim( 0, 30 ) + # adding x-axis limits
ggtitle("Yeh Superplasticizer Tests") + #Custom Title
xlab("Slump (cm)") + # NEW : Custom Axis Label
geom_histogram(fill="darkgreen") # insert histogram (with a single chosen color)

5.2 Distribution Plot [not so good an] Example
There are some other plots that we can use to describe our data.
Here to play with them we will take a quick step back and address that “tidy”’ed (should that say “tidied”?) dataframe “concrete_tidy”
We can now use all the parameters in the “tidy” (long) data frame to print by specific traits.
ggplot(data = concrete_tidy) + # invoke graphics environment using a given dataframe
theme_bw() + # changing the plotting theme
aes(x = Value, # map x-axis value
color = Parameter) + # map colors for different quality
ggtitle("Yeh Superplasticizer Tests") + # Custom Title
xlab("Value") + # Custom Axis Label
geom_density() # insert crete a relative density plot

In the past, I’ve gotten good results with this but in this case, I think it’s too messy in part due to the disparity in the dynamic range of our parameters.
5.3. Box-Whisker Plot Example
How about leveraging a box whisker? (I’m using only the independent variables this time.)
ggplot(data = concrete_independent) + # EDIT Changing dataframe
theme_bw( ) + # changing the plotting theme
theme(axis.text.x = element_blank()) + # adding an extra trait to the x-axis
# to not print labels on the x-axis
# (the labels overlap and doesn't look
# pretty...)
aes(y = Value, # map y-axis value
x = Parameter, # map x-axis value
color = Parameter) + # map colors for different quality
ggtitle(label = "Yeh Superplasticizer Tests",
subtitle = "Concrete Test Components") + # Custom Title
ylab(expression('Amount (kg m'^-3*")")) + # EDIT : Changing Custom Axis Label
geom_boxplot() # insert crete a relative density plot

What about our dependant variables? We can start by changing the data frame…
ggplot(data = concrete_dependent) + # EDIT Changing dataframe
theme_bw( ) + # changing the plotting theme
theme(axis.text.x = element_blank()) + # adding an extra trait to the x-axis
# to not print labels on the x-axis
# (the labels overlap and doesn't look
# pretty...)
aes(y = Value, # map y-axis value
x = Parameter, # map x-axis value
color = Parameter) + # map colors for different quality
ggtitle(label = "Yeh Superplasticizer Tests",
subtitle = "Concrete Test Results") + # Custom Title
ylab("Values") +
geom_boxplot() # insert crete a relative density plot

Want units? That’s a little tougher here since the units differ by parameter. We can force the values to into new names though.
ggplot(data = concrete_dependent) + # EDIT Changing dataframe
theme_bw( ) + # changing the plotting theme
theme(axis.text.x = element_blank()) + # adding an extra trait to the x-axis
# to not print labels on the x-axis
# (the labels overlap and doesn't look
# pretty...)
aes(y = Value, # map y-axis value
x = Parameter, # map x-axis value
color = Parameter) + # map colors for different quality
ggtitle(label = "Yeh Superplasticizer Tests",
subtitle = "Concrete Test Results") + # Custom Title
ylab("Values") +
# NEW: It says scale color but "color" is how we are distinguishing
# out boxplots (as seen in the mapping/aes command)
# we can then use the same plot order above to rewrite the labels
# (likewise we could change the plot order and of coruse the colors.)
scale_color_discrete(labels = c("Slump (cm)",
"Flow (cm)",
"28dy-Compresional Stress (mPa)")) +
geom_boxplot() # insert crete a relative density plot

NA
NA
5.4. Violin Plot Example
How about leveraging a “violin” plot? A violin plot’s width swells in areas with more observations and contracts with sparser data so it is like looking at a probability distribution.
ggplot(data = concrete_independent) + # EDIT Changing dataframe
theme_bw( ) + # changing the plotting theme
theme(axis.text.x = element_blank()) + # adding an extra trait to the x-axis
# to not print labels on the x-axis
# (the labels overlap and doesn't look
# pretty...)
aes(y = Value, # map y-axis value
x = Parameter, # map x-axis value
color = Parameter) + # map colors for different quality
ggtitle(label = "Yeh Superplasticizer Tests",
subtitle = "Concrete Test Components") + # Custom Title
ylab(expression('Amount (kg m'^-3*")")) + # Changing Custom Axis Label
geom_violin(scale="width") # EDIT: change to a violin plot

# the width argument
# gives every plot the same width
and…
ggplot(data = concrete_dependent) + # EDIT Changing dataframe
theme_bw( ) + # changing the plotting theme
theme(axis.text.x = element_blank()) + # adding an extra trait to the x-axis
# to not print labels on the x-axis
# (the labels overlap and doesn't look
# pretty...)
aes(y = Value, # map y-axis value
x = Parameter, # map x-axis value
color = Parameter) + # map colors for different quality
ggtitle(label = "Yeh Superplasticizer Tests",
subtitle = "Concrete Test Results") + # Custom Title
ylab("Values") +
# NEW: It says scale color but "color" is how we are distinguishing
# out boxplots (as seen in the mapping/aes command)
# we can then use the same plot order above to rewrite the labels
# (likewise we could change the plot order and of coruse the colors.)
scale_color_discrete(labels = c("Slump (cm)",
"Flow (cm)",
"28dy-Compresional Stress (mPa)")) +
geom_violin(scale="width") # EDIT: change to a violin plot

# the width argument
# gives every plot the same width
This is basically the above “density” plot but “looking down” as with a box plot. Also here we are trimming the plot so that when we leave the range of any of the data points, the “violins” are truncated.
5.5. Stacked Column or Bar Plot Example
We also can do bar plots or stacked column plots. The one produced here shows the combined components by test unit.
ggplot(data = concrete_independent) + # EDIT Changing dataframe
theme_bw( ) + # changing the plotting theme
aes(x = Test_Number,
y = Value,
fill = Parameter) + # map colors for different quality
ggtitle(label = "Yeh Superplasticizer Tests",
subtitle = "Concrete Test Components") + # Custom Title
ylab(expression('Amount (kg m'^-3*")")) + # Changing Custom Axis Label
geom_col(position = "stack", # new, create a stacekd column graph
width = 1.0 ) # with no space between columns

6. Correlation of Variables
6.1. Correlating and then Fitting Cement to Compressive Strength
Let’s start by doing a “simple”" plot . In this case since I already know the answer because the spreadsheet also has a table of how well our independent variables correlate against the dependent variables (e.g., Slump, Flow, or in our case Strength). The Cement correlates the best against Compressive Strength (OK, truth be told, it correlates the least badly).
We can actually do this with a correlate function, cor()…
To grab a value in the table “concrete” we call the data frame (concrete) and the variable name (Cement or Water vs Compressive_Strength_28dy), separating the frame and variable names by a $ sign.
print("Cement vs Compressive Strength Correlation, r")
[1] "Cement vs Compressive Strength Correlation, r"
cor(x = concrete$Cement, # the x-value
y = concrete$Compressive_Strength_28dy, # the y-value
method = "pearson" # method of correlation
)
[1] 0.45
or if you like to do everything at once…
# calculate all correlation values against each other
correlation_matrix = cor(x = concrete, # using our dataframe to correlate evything
method = "pearson" )
tbl_df(correlation_matrix)
NA
Lots of numbers… not all that insightful on their own…
You also can graph the look-n-feel of what all of the different correlations are… (it works best with a much smaller number of variables)
# draw a coorelation graphic...
corrplot(corr = correlation_matrix,
type = "upper")

We can now see for example that cement, slag, and fly ash amounts have a nominal but not thrilling correlation to compression strength while water has a good correlation with the resulting slump values. One thing that this does not show is how well these parameters play with other parameters. As we’ll see when all of our independent values are working together we’ll discover that cement and water, followed by fly ash and coarse aggregates will, together, contribute the most of our independent parameters in calculating the compressive strength.
6.2. Scatter Plot Example
But for now, let’s plot plot the Cement amount against Compressive Strength
# Making a simple X-Y scatterplot.
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw( ) + # changing the plotting theme
aes(x = Cement, # x-value
y = Compressive_Strength_28dy) + # y-value
ggtitle("Yeh Superplasticizer Tests") + # Custom Title
xlab(expression('Cement Amount (kg m'^3*")")) + # x-label
ylab("28-dy Compressive Strength (MPa)") + # y-label
geom_point(colour="grey") # EDIT: plot points the color keyword part was

# writen by an anglophile!
Here’s a cute trick: Could we color those dots by a variable?
Sure!
# Making a simple X-Y scatterplot now coloured by another parameter
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw( ) + # changing the plotting theme
aes(x = Cement, # x-value
y = Compressive_Strength_28dy, # y-value
color = Superplasticizer) + # ADD: we can color by a variable!
ggtitle("Yeh Superplasticizer Tests") + # Custom Title
xlab(expression('Cement Amount (kg m'^3*")")) + # x-label
ylab("28-dy Compressive Strength (MPa)") + # y-label
geom_point() + # plot points
scale_color_distiller(palette = "Spectral") # NEW: pick a custom "colour" palate.

Love overkill without any distinct numerical score and look at how everything in your data set correlates with every other variables…?
Try pairs()
(I like the corrplot function better!)
# way too many tiny plots!
pairs(x = concrete, # do everything in the dataframe
pch = ".") # plot dots (the default is circles)

(Obviously the more variables in your dataframe the messier it gets!)
6.3. Creating our linear model and “calibrating” it
We weren’t all that thrilled with the correlation between these components and strength but let’s go ahead and demonstrate a regression.
But let’s move on and create a regression model from this.
Here we will use the lm() (linear model) function from the MASS package.
For the regression formula
\(\widehat{y}(x) = {\alpha_0}+{\alpha_1}\ x\)
or
\(\widehat{Strength}(concrete) = {\alpha_0}+{\alpha_1}\ concrete\)
the “prototype” (formula) for the function is written as …
“Y ~ X” (with the y-intercept implicit in the formula… you don’t put it in but it’ll be there when you’re done.)
The above syntax is works like this….
Dependent Variable [~ is a function of ] Independent Variable [and any other parameter you need gets added with a plus]
If this were a \(\widehat{y}(x)={\alpha_0}+{\alpha_0}\ x^3\), then the prototype for the function would be y ~ x^3
This will hopefully make more sense as we continue!
(lm and similar linear regression functions don’t play well with units.)
linear_model.S_v_c = lm(formula = Compressive_Strength_28dy ~ Cement, # your formula y ~ x
data = concrete) # the data frame
Let’s see what we have… This summary command will provide the details of the lm() function’s important results
For us we want to see the Y-Intercept [the (Intercept) under “Estimate”] and the slope that goes with our independent value (“Concrete” under “Estimate”)
The Standard Error of the Estimate is there (Residual Standard Error) as is the Coefficient of Determination (Multiple R-squared)
We’ll talk about a few of the other features when we do the larger multivariate regression
summary(object = linear_model.S_v_c)
Call:
lm(formula = Compressive_Strength_28dy ~ Cement, data = concrete)
Residuals:
Min 1Q Median 3Q Max
-15.134 -5.313 0.832 5.155 17.968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.85676 2.15022 12 < 2e-16 ***
Cement 0.04429 0.00885 5 2.4e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7 on 101 degrees of freedom
Multiple R-squared: 0.199, Adjusted R-squared: 0.191
F-statistic: 25 on 1 and 101 DF, p-value: 2.38e-06
In the above output, the asterisk identify the most significant independent variables. Here it’s trivial even though this is a terrible relationship between cement and strength. Later we will use all of our available independent variables and the use of these asterisks will become more important.
Want to plot it?
Good news?
Like Excel, you have some automated features to give you quick satisfaction and happiness. More still, it will give you confidence limits.
For this we use an extension to the graphics package called geom_smooth()
# Making a simple X-Y scatterplot and adding a regression to it
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw( ) + # changing the plotting theme
aes(x = Cement, # x-value
y = Compressive_Strength_28dy) + # y-value
ggtitle("Yeh Superplasticizer Tests") + # Custom Title
xlab(expression('Cement Amount (kg m'^-3*")")) + # x-label
ylab("28-dy Compressive Strength (MPa)") + # y-label
geom_point(colour="darkgrey") + # plot points
geom_smooth(method = "lm", # use a simple linar model
formula = y ~ x, # lm-style formula
se = TRUE, # splay Confidence Intervals
level = 0.95, # Confidene Level to Map Out
colour = "black", # regression line color
size = 0.5) # line thickness

The line here looks like a positive correlation between the cement amount and the resulting strength.
Let’s try water:
# getting the linear model
linear_model.S_v_w = lm(formula = Compressive_Strength_28dy ~ Water, # your formula y ~ x
data = concrete ) # the data frame
summary(linear_model.S_v_w)
Call:
lm(formula = Compressive_Strength_28dy ~ Water, data = concrete)
Residuals:
Min 1Q Median 3Q Max
-19.359 -5.451 -0.986 4.690 18.825
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.4824 7.3978 7.50 2.5e-11 ***
Water -0.0986 0.0373 -2.64 0.0096 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.6 on 101 degrees of freedom
Multiple R-squared: 0.0646, Adjusted R-squared: 0.0554
F-statistic: 6.98 on 1 and 101 DF, p-value: 0.00956
# Making a simple X-Y scatterplot and adding a regression to it
ggplot(data = concrete) + # invoke graphics environment using a given dataframe
theme_bw( ) + # changing the plotting theme
aes(x = Water, # x-value
y = Compressive_Strength_28dy) + # y-value
ggtitle("Yeh Superplasticizer Tests") + # Custom Title
xlab(expression('Water Amount (kg m'^-3*")")) + # x-label
ylab("28-dy Compressive Strength (MPa)") + # y-label
geom_point(colour="darkblue") + # plot points
geom_smooth(method = "lm", # use a simple linar model
formula = y ~ x, # lm-style formula
se = TRUE, # splay Confidence Intervals
level = 0.95, # Confidene Level to Map Out
colour = "blue", # regression line color
fill = "cyan", # NEW: fill for confidence limits
size = 0.5) # line thickness

Looking up back the tables none of the variables
7. Multivariate Linear Regression
And now we’re going to do something about that!
We’re now going to use not just one independent variable… but all 7 of them!
The good news is that it follows the same form as the simple linear regression. This time we string along all of our independent variables with in our formula prototype.
Our formula now has multiple independent values but still follows the same style of solution…
\(\widehat{y}(\mathbf{x}) = {\alpha_0}+{\alpha_1} x_1 + {\alpha_2} x_2 + {\alpha_2} x_3 + ... +{\alpha_n} x_n\)
linear_model.S_v_all <- lm(data = concrete, # your data frame
formula = Compressive_Strength_28dy ~ Cement + # your formula
Slag +
Fly_Ash +
Water +
Superplasticizer +
Fine_Aggregates +
Coarse_Aggregates)
And here are these results…
summary(object = linear_model.S_v_all)
Call:
lm(formula = Compressive_Strength_28dy ~ Cement + Slag + Fly_Ash +
Water + Superplasticizer + Fine_Aggregates + Coarse_Aggregates,
data = concrete)
Residuals:
Min 1Q Median 3Q Max
-5.841 -1.706 -0.283 1.299 7.942
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 139.7815 71.1013 1.97 0.0522 .
Cement 0.0614 0.0228 2.69 0.0084 **
Slag -0.0297 0.0318 -0.94 0.3520
Fly_Ash 0.0505 0.0232 2.18 0.0316 *
Water -0.2327 0.0717 -3.25 0.0016 **
Superplasticizer 0.1031 0.1346 0.77 0.4453
Fine_Aggregates -0.0391 0.0288 -1.36 0.1783
Coarse_Aggregates -0.0556 0.0274 -2.03 0.0455 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.6 on 95 degrees of freedom
Multiple R-squared: 0.897, Adjusted R-squared: 0.889
F-statistic: 118 on 7 and 95 DF, p-value: <2e-16
Our regression coefficients are still here under the “Estimate” column as are our Standard Error of our Estimate and our Coeff of Determination.
Also we can now take a good look at those asterisks at the end of line with the parameter coefficients. These can explain which independent variables do the heaviest lifting in our regression. The more asterisks, the more important the dependent variable is to the larger multivariate regression. Here, we can see that the Cement and Water are doing most of the “work” in fitting our suite of independent variables to our dependent variable of Compressive Strength.
Finally there is the P parameter for which the smaller it is, the better we can say that the relationship that we’ve made with our regression represents our dependent variable.
Now… on to looking at our results.
Here is where viewing the results of the regression is tricky.
We have 7 independent variables but we’d like to see the impact of the fit if all 7 variables on our strength
When I do this I like to plot the true y value against my regression y(x1,x2,x3,..)
So to do this I will take the fitted values of y and plot them against the original values of y
Getting the fitted values is easy.
I’m using the get_regression_points function which adds the modeled “y-hat” value to the dataframe of all of the other values get_regression_points() function.
The fitted version is the dependent variable w/ a "_hat"" at the end
fitted.S_v_all = get_regression_points(model = linear_model.S_v_all)
print(fitted.S_v_all)
NA
And finally we can plot our actual vs modeled values. (I’m adding a trend line)
# Making a simple X-Y scatterplot and adding a regression to it
ggplot(data = fitted.S_v_all) + # invoke graphics environment using a given dataframe
theme_bw( ) + # changing the plotting theme
aes(x = Compressive_Strength_28dy, # x-value
y = Compressive_Strength_28dy_hat) + # y-value
ggtitle("Yeh Superplasticizer Tests",
subtitle = "28-dy Compressive Strength (MPa)") + # EDITED: Custom Title now with a subtitle
ylab("Modelled") + # y-label
xlab("Observed") + # x-label
geom_point(colour="darkred") + # plot points
geom_smooth(method = "lm", # use a simple linar model
formula = y ~ x, # lm-style formula
se = TRUE, # display Confidence Intervals
level = 0.95, # Confidene Level to Map Out
colour = "red", # regression line color
fill = "magenta", # fill for confidence limits
size = 0.5) + # line thickness
geom_abline(slope = 1, # NEW: add a very simple line
intercept = 0, # (for a 1:1 reference)
color = "grey",
linetype = "dashed") +
coord_fixed(ratio = 1) # NEW: make the aspect ratio

# (I like my plots square)
And here we have a nice plot showing our true vs predicted values.
8. Regression Quality Metrics
And to close things off, we can do some general error metrics that may be useful..
First, the Mean Squared Error (MSE) or Bias… (if we are too high or too low)
\(BIAS = MSE = \frac{1}{N} \sum_{i=1}^{n} [\widehat{y}(\overrightarrow{x_i})-y_i] = \overline{[\widehat{y}(\overrightarrow{x_i})-y_i]}\)
# Calculate Bias (MSE)
bias = mean(fitted.S_v_all$Compressive_Strength_28dy_hat -
fitted.S_v_all$Compressive_Strength_28dy)
print(str_c(" Mean Squared Error (MSE) or Bias: ", bias))
[1] " Mean Squared Error (MSE) or Bias: 2.91262135922341e-05"
For a linear or multivariate regression the average of our residuals (the difference between each observation and prediction) should be zero.
The root mean squared error (RMSE) is shown here. It shouldn’t be zero since the residuals are squared before summing them up. We technically should use the standard error of the estimate, but RMSE remains a common error metric. We can always do both. The standard error of the estimate takes into account the degrees of freedom which which now includes all of the independent variables (p). We can get the standard error of the estimate from our
\(RMSE = \sqrt{ \frac{1}{N} \sum_{i=1}^{n} [\widehat{y}(\overrightarrow{x_i})-y_i]^2 } = \sqrt{\overline{[\widehat{y}(\overrightarrow{x_i})-y_i]^2} }\)
\(s_{e}\) or \(s_{y/x} = \sqrt{ \frac{1}{N-p-1} \sum_{i=1}^{n} [\widehat{y}(\overrightarrow{x_i})-y_i]^2 }\)
# Calculate RMSE
rmse = sqrt(mean( (fitted.S_v_all$Compressive_Strength_28dy_hat -
fitted.S_v_all$Compressive_Strength_28dy)^2) )
print(str_c(" Root Mean Squared Error (RMSE): ",
rmse))
[1] " Root Mean Squared Error (RMSE): 2.50527978593714"
print(str_c("Standard Error of the Estimate (se): ",
summary(linear_model.S_v_all)$sigma)) # you have to dig for this one!
[1] "Standard Error of the Estimate (se): 2.60865763395229"
And finally our correlation coefficient (which is basically our coefficient of determination before the “R” is “squared”)
# Get The Unadjusted Correlation Coefficient
r = cor(x = fitted.S_v_all$Compressive_Strength_28dy, # the x-value
y = fitted.S_v_all$Compressive_Strength_28dy_hat, # the y-value
method = "pearson" # method of correlation
)
print(str_c(" correlation coefficient (r): ", r))
[1] " correlation coefficient (r): 0.94701611900088"
print(str_c(" coefficient of determination (r²): ", r^2,
" ",
summary(linear_model.S_v_all)$r.squared))
[1] " coefficient of determination (r²): 0.896839529647489 0.896837609814009"
print(str_c("adjusted coefficient of determination (Adjusted r²): ",
summary(linear_model.S_v_all)$adj.r.squared))
[1] "adjusted coefficient of determination (Adjusted r²): 0.889236170537147"
And with that, we’re done… Once again, this exercise demonstrates a lot of tricks just to show how you can use R for various statistics. You may not use all of them in your encouters with R for linear or multivariate regression or even at all, but you may be able to cannibalize some of the tricks here for other applications.
---
title: "Visualizing Statistics and Regressions from a Spreadsheet using R"
output:
  pdf_document: default
  html_notebook:
    toc: yes
  html_document:
    df_print: paged
    toc: yes
--- 

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

These notebooks are typically this is designed to create a pleasing viewing environment of data analysis that allows you to include figures, text, links, etc. so that your work is better understood and can be reproduced and used with confidence.

The source code for this R notebook (Rmd suffixed files), when stored as web pages (html files), can be downloaded by clicking the button at the top of the page.

If viewing the source code in R Studio, try executing each R "chunk" by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. z


**Warning. Typos are *Legion*!**

# 1. Introduction

When you're in [MATH 381 (Intro to Probability and Stats)](http://ecatalog.sdsmt.edu/preview_course_nopop.php?catoid=17&coid=26571) you'll get a taste of R.  R is an open-source statistical package build off of an earlier generation of commercial.

The goal here is to demonstrate cracking open an excel spreadsheet in R and calculate some basic stats, create various plots to view the statistics, and finally, do some linear and multivariate regression

Another goal here is to show off some of R's features.  R is a very powerful tool.  When translating "powerful" from computereese to any frustrated human dialect, that means "steep learning curve."  It's also a community-supported environment.  When translating "powerful" from computereese to any overscheduled human dialect, that means "there are LOTS of people donating packages and libraries to R."  Some have evolved to be a standard in the community.  Others are highly specialized for a given discipline (but have one or two items that people outside their user communities find handy.)

But don't let that intimidate you.  Once you learn one language you can slowly pick up more.  Also with this demo we aren't going to get to to be an R guru in a day. 

If you want a good stepping off point to learn R I'd recommend some of the resources at [Data Camp](https://www.datacamp.com/courses/free-introduction-to-r) which have some free starter tutorials for R.


# 2. Loading the Libraries

To work with R we will first have to load some libraries.  This is like in C where you have the #include statement to do things like raise things to powers and stuff like that.

Some of these libraries or "packages" come with R.  Others will have to be installed.  Here are the ones we are using for this exercise.

Also in this exercise, we're going to use the [tidyverse](https://www.tidyverse.org) set of packages.  Tidyverse is a set of co-developed tools for data science in R.  This is the new big thing in R and is widely used so we are just going to jump in here.  SD Mines has a course beyond Engineering Stats, [MATH 443/543 (Data Analysis)](http://ecatalog.sdsmt.edu/preview_course_nopop.php?catoid=17&coid=26973) that leverages this set of packages.

* Install Us First
  + [tidyverse](https://www.tidyverse.org) : Set of commonly-used Data Science packages for R that it can install and load all at once. In the long-run you probably also want to install the tidyverse package suite anyway. For this exercise this will include...   
    - [ggplot2](https://ggplot2.tidyverse.org) : Create Elegant Data Visualizations Using the Grammar of Graphics
    - [tibble](https://tibble.tidyverse.org) : Simple Data Frames
    - [tidyr](https://tidyr.tidyverse.org) : Tools for shepherding data in data frames.
    - [readr](https://readr.tidyverse.org) : Read Rectangular Text Data
    - [purr](https://purrr.tidyverse.org) : Functional Programming Tools
    - [dplyr](https://dplyr.tidyverse.org) : A grammar of data manipulation
    - [stringr](https://stringr.tidyverse.org) : Simple, Consistent Wrappers for Common String Operations
    - [forcats](https://forcats.tidyverse.org) : Tools for Working with Categorical Variables (Factors)

  + [readxl](https://www.rdocumentation.org/packages/readxl/versions/1.1.0) : also part of the [tidyverse](https://www.tidyverse.org) package suite for reading traditional excel spreadsheets.  
  + [moderndive](Tidyverse-Friendly Introductory Linear Regression) : Tidyverse-Friendly Introductory Linear Regression

  
* This should come with R's core install, if not install 'em.
  + [MASS](https://www.rdocumentation.org/packages/MASS/versions/7.3-50) : Has a lot of resources for regression.

* This doesn't come with R's core install so install that one... 
  + [moments](https://www.rdocumentation.org/packages/moments/versions/0.14) : This has a load of good stuff for data analysis and plotting, more than you will need here, but get it anyway.

* This is a nice contributed library that lets us make pretty statistics tables.  It was written for ecological applications but it's still pretty handy for looking at concrete
  + [pastecs](https://www.rdocumentation.org/packages/pastecs/versions/1.3.21): Package for Analysis of Space-Time Ecological Series
  
* Another nice contributed library that makes matrices of correlation coefficients look pretty (and  graphically informative).
  + [corrplot](https://www.rdocumentation.org/packages/corrplot/versions/0.84) Visualization of a Correlation Matrix

* While not officially needed for this activity but I'll demonstrate how units can be used in R in this example
  + [udunits2](https://www.rdocumentation.org/packages/udunits2/versions/0.13) Provides simple bindings to Unidata's udunits library for unit conversions (will be demonstrating but not explicity needing it here)
  + [units](https://www.rdocumentation.org/packages/units/versions/0.6-0) Provides Measurement Units for R Vectors

```{r}

  # Tidyverse Handling Libraries

    
library(package = "tidyverse")  # main tidyverse suite
  library(package = "readxl")     # Read Excel Files
  library(package = "moderndive") # regression support

  # Statistics Libraries

  library(package = "moments")   # Moments, cumulants, skewness, kurtosis and related tests
  library(package = "MASS")      # Support Functions and Datasets for Venables & Ripley's MASS text

  # Extra Graphics Libraries

  library(package = "corrplot")  # Visualization of a Correlation Matrix


  # Data Processing Libraries

  library(package = "pastecs")   # Package for Analysis of Space-Time Ecological Series

  library(package = "udunits2")  # Unit Conversion Support
  library(package = "units")     # Measurement Units for R Vectors

```

# 3. Cracking a Spreadsheet

The spreadsheet example below is a more complicated than what you hopefully have.

The original data set is from a set of papers on Concrete by I-Cheng Yeh 

* [Yeh, I-Cheng, "Modeling slump of concrete with fly ash and superplasticizer," *Computers and Concrete*, **5**(6), 559-572, 2008. doi: 10.12989/cac.2008.5.6.559.](http://www.techno-press.org/content/?page=article&journal=cac&volume=5&num=6&ordernum=4)

* [Yeh, I-Cheng, "Simulation of concrete slump using neural networks," *Construction Materials*, **162**(1), 11-18, 2009. doi: 10.1680/coma.2009.162.1.11](https://www.icevirtuallibrary.com/doi/10.1680/coma.2009.162.1.11)

* [Yeh, I-Cheng, "Prediction of workability of concrete using design of experiments for mixtures," *Computers and Concrete*, **5**(1), 1-20, 2008. doi: 10.12989/cac.2008.5.1.001](http://www.techno-press.org/content/?page=article&journal=cac&volume=5&num=1&ordernum=1)

* [Yeh, I-Cheng, "Modeling slump flow of concrete using second-order regressions and artificial neural networks," *Cement and Concrete Composites*, **29**(6), 474-480, 2007. doi: 10.1016/j.cemconcomp.2007.02.001](https://www.sciencedirect.com/science/article/pii/S0958946507000261?via%3Dihub)

* [Yeh, I-Cheng, "Exploring concrete slump model using artificial neural networks," *ASCE J. of Computing in Civil Engineering*, **20**(3), 217-221, 2006. doi: 10.1061/(ASCE)0887-3801(2006)20:3(217)](https://ascelibrary.org/doi/10.1061/%28ASCE%290887-3801%282006%2920%3A3%28217%29)

and is kept at the [UC-Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Concrete+Slump+Test).



It can be found here at [http://kyrill.ias.sdsmt.edu/cee_284/Base_Concrete_Slump_Test_for_R.xlsx](http://kyrill.ias.sdsmt.edu/cee_284/Base_Concrete_Slump_Test_for_R.xlsx)

The relevant page and screenshot is below.  For drama-free R import you are probably best off keeping a page on your spreadsheet file that is very simple, with numbers going down, and a single line for Row-1 with the headers of each column.  If you want to get fancy on other pages that you'd turn in as tables in reports, you can do that on another spreadsheet page.

![Concrete Spreadsheet Screenshot](http://kyrill.ias.sdsmt.edu/wjc/eduresources/Base_Concrete_Slump_Test_for_R.png)

To crack open the spreadsheet we will want to use the [read_excel](https://www.rdocumentation.org/packages/readxl/versions/1.1.0/topics/read_excel) function.

You can read the spreadsheet from a local drive or from a website.

```{r}

  # you will need the full path to the file you are using (either online or locally on your disk)

  # The if else block should query your machine to determine which operating system.
  #  if you are not bi-platform, you likely don't need this.

  if(.Platform$OS.type == "windows") {
    # Windows
    spreadsheet_name     = "%HOMEPATH%/Downloads/Base_Concrete_Slump_Test_for_R.xlsx"
  } else {
    # Unix (Linux, MacOS, Solaris)
    spreadsheet_name     = "~/Downloads/Base_Concrete_Slump_Test_for_R.xlsx"
  }


  # I am keeping a copy of these spreadsheet at the URL below.  It can be downloaded automatically
  #   and then loaded.  We can also discretely delete it when done.

      spreadsheet_url = "http://kyrill.ias.sdsmt.edu/wjc/eduresources/Base_Concrete_Slump_Test_for_R.xlsx"
   
      download.file(url      =   spreadsheet_url, # URL location
                    destfile = spreadsheet_name) # local downloaded location
      
      remove(spreadsheet_url) # clean up variables
  
  # this command will read the file

  concrete = read_excel(path      = spreadsheet_name,  # remove spreadsheet location
                        sheet     = "Data",            # page of spreadsheet
                        col_names = TRUE)              # first row are the column headers
  
  
  # clean up your hard drive!  Don't be like me!

  if(.Platform$OS.type == "windows") {
    # Windows
    system(str_c("DEL   ", 
                 spreadsheet_name,
                 sep=""))
    } else {
    # Unix (Linux, MacOS, Solaris)
    system(str_c("rm -v  ", 
                 spreadsheet_name,
                 sep=""))
      }
  
  remove(spreadsheet_name) # clean up variables

  
```

With the data read in we can now look at the table of the data.  This looks much nicer when working in R Notebooks instead of Plain Ordinary R.

```{r}

  # Print data frame
  colnames(concrete)[1] = "Test_Number"
  print(concrete)

```
### Extra: Units (not part of this exercise but it's a nifty tangent)

*Dang.  I like units. I don't see any.  I'm anal and have learned that adding as much descriptive data early on in processing your data set will make people (and most importantly, yourself) not hate you at a later date.  So I am adding them here with the [set_units](https://www.rdocumentation.org/packages/units/versions/0.6-0/topics/set_units) function. This will add units as an attribute.

Units don't work with everything and you should probably keep a copy of your original un-unitted data frame.  


```{r}

# first we clone our data frame

concrete_units = concrete

concrete_units$Cement                    = set_units(x     = concrete_units$Cement, 
                                                     value = "kg m-3")

concrete_units$Slag                      = set_units(x     = concrete_units$Slag, 
                                                     value = "kg m-3")

concrete_units$Fly_Ash                   = set_units(x     = concrete_units$Fly_Ash, 
                                                     value = "kg m-3")

concrete_units$Water                     = set_units(x     = concrete_units$Water, 
                                                     value = "kg m-3")

concrete_units$Superplasticizer          = set_units(x     = concrete_units$Superplasticizer, 
                                                     value = "kg m-3")

concrete_units$Coarse_Aggregates         = set_units(x     = concrete_units$Coarse_Aggregates, 
                                                     value = "kg m-3")

concrete_units$Fine_Aggregates           = set_units(x     = concrete_units$Fine_Aggregates, 
                                                     value = "kg m-3")

concrete_units$Slump                     = set_units(x     = concrete_units$Slump, 
                                                     value = "cm")

concrete_units$Flow                      = set_units(x     = concrete_units$Flow, 
                                                     value = "cm")

concrete_units$Compressive_Strength_28dy = set_units(x     = concrete_units$Compressive_Strength_28dy, 
                                                     value = "MPa")

print(concrete_units)


```

If you click in the Global Environment Box, those units aren't arbitrary strings. They are listed as numerators, denominators and also the way in which squares, etc., are archived are explicit.

Better Still, the same command of set_units when applied to a variable that already has units will convert it.  This is nice when moving between SI units, USCS units.  [If you are going to be cheeky and try the Furlong/Firkin/Fortnight system (FFF), sorry to disappoint, that while the udunits2 package in R recognizes all three units, it recognizes firkins as a volume measure (which is really is) and not the mass measure based on density of water.]

Example here:

```{r}

  # a little unit-fu™️ play!

  strength_in_psi = set_units(x     = concrete_units$Compressive_Strength_28dy,
                              value = "psi")

  print(concrete_units$Compressive_Strength_28dy[1])
  print(strength_in_psi[1])
  
  # Ok now I'm being silly but so were the package developers.  
  # Blame them.  
  # (Once again, I can't do official FFF units)

  cement_in_slug_per_cu3 = set_units(x     = concrete_units$Cement,
                                     value = "slugs/furlongs^3")
  
  print(concrete_units$Cement[1])
  print(cement_in_slug_per_cu3[1])
  
  
  # cleaning-up our horseplay..
  
  remove(strength_in_psi)
  remove(cement_in_slug_per_cu3)
  
  remove(concrete_units)
  
```
Caveat!  As useful as this can be, know this:  Not all R functions play nice with units or other "attributes" in data frames  Some of the plotting routines and linear regression routines below will work with this.

If you need your units and want to minimize "messy" code in R when it conflicts any given function.  You can later strip out units by using the [as.numeric()](https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/numeric) function


# 4. Some Basic Statistics and Traditional Single Variable Plots

Lets start with some basic statistics and plotting of them.

## 4.1. The "classic" stats

Let's get the mom-and-apple-pie stats for Concrete
That second argument allows you to deal with missing data.



```{r}

  # statistics for cement


  print(str_c("    Mean Cement : ",
              mean(x     = concrete$Cement, # variable to crunch
                   na.rm =            TRUE) # ignore msissing data
              ))

  print(str_c("   Stdev Cement : ",
              sd(x     = concrete$Cement, # variable to crunch
                 na.rm =            TRUE) # ignore msissing data
              ))
  
  print(str_c("Skewness Cement : ",
              skewness(x     = concrete$Cement, # variable to crunch
                       na.rm =            TRUE) # ignore msissing data
              ))
  
  print(str_c("Kurtosis Cement : ",
              kurtosis(x     = concrete$Cement, # variable to crunch
                       na.rm =            TRUE) # ignore msissing data
              ))
     
```

OK this is a little clunky.  It would be nice if someone somewhere made a support library for R that will make nice tables of statistics.

In this case Vive La France! A team from French Research Institute for Exploitation of the Sea thought the same question and as is often the case for the R community not only drafted a set of tools to do this, *and* made it public.

Here we ware using their [stat.desc](https://www.rdocumentation.org/packages/pastecs/versions/1.3.21/topics/stat.desc) function.

This will hopefully give people wanting to make basic tables "maximum satisfaction with minimal effort."

```{r}

  # Plot a statistics table -- all the classics nice and handy and pretty.

  options(digits=2) # this simply set the decimal count in the table to be created below  
                    # this particular function creates the table in scientific notation
  
  concrete_statistics = stat.desc(x    = concrete,  # data frame
                                  basic =    TRUE,  # includes counts and extremes 
                                  desc =     TRUE,  # include classic stats (mean etc)
                                  norm =     TRUE,  # include normal dist stats (skewness etc)
                                  p    =     0.95)  # use 95% confidence limits


  print(concrete_statistics)

```



## 4.2. Reorganizing Your Data to Handle Multiple Variables at Once

To leverage some of R's more nifty features we will need to reorganize our data from a "spreadsheet style" format to what some people have called a "long form" table so that the column headers of our concrete traits become a single column with the values in the columns placed all into a single column similar to the graphic below.


![Example of the Gather Function](https://jules32.github.io/2016-07-12-Oxford/dplyr_tidyr/img/rstudio-cheatsheet-reshaping-data-gather.png)

This is done with the function [gather()](https://www.rdocumentation.org/packages/tidyr/versions/0.8.1/topics/gather)


```{r}

  # Gathering our components into a single column.

  # We just want the names of our components here so we get everything past
  # the first column (which is the experiment name)

  column_names  = colnames(concrete[2:ncol(concrete)])   

  tbl_df(column_names) # tbl_df makes it look pretty when printed

  # the gather command will group everything. in the column name group 

  concrete_tidy = gather(data  =    concrete, # your data frame
                         key   = "Parameter", # column name for your former columns
                         value =     "Value", # column name for your data
                         column_names       ) # the list for the columns to "gather"

  

  # this will let us sort future plots in the same order as our plots.  
  
  concrete_tidy$Parameter = factor(x      = concrete_tidy$Parameter,
                                   levels = column_names)
  
  # we can also split things between our dependant variables and independant variables.
  
  
  concrete_independent = subset(x      = concrete_tidy,
                                subset = (Parameter != "Slump") &
                                         (Parameter != "Flow")  &
                                         (Parameter != "Compressive_Strength_28dy")
                                ) 
    
    
  concrete_dependent = subset(x      = concrete_tidy,
                              subset = (Parameter == "Slump") |
                                       (Parameter == "Flow")  |
                                       (Parameter == "Compressive_Strength_28dy")
                              )

 
                       

  print(concrete_tidy)
  print(concrete_independent)
  print(concrete_dependent)
  
  
```





# 5. Plotting Graphics using Tidyverse Resources

R has a few ways to do the basic histograms, Boxplots and other distribution plots.

There are a number of spiffy ways to plot these statistical plots in R. We're just using one here...

## 5.1.  SLOOOOWWWWLLLLLYYY Making a Simple Plot (Histogram Edition)

Now I'm going to do this one tiny step at a time until we get to a viable product.  (This is how I work through cryptic procedures so I can see what each little additional mystery thingie does.)

Graphing is invoked by the [ggplot2](https://ggplot2.tidyverse.org) command.. which has a heluvalot under its hood!  For me all that detail was what had me a little shy to adopt this way of printing data.

Tidyverse uses what is sometimes called the ["grammar of graphics"](https://ramnathv.github.io/pycon2014-r/visualize/ggplot2.html) method... to make a long story longer, the GoG presents separate commands to do separate things rather bundle stuff in a single graphing function.  Sometimes it makes a lot of sense... other times it may be confusion.  (Hence me demonstrating making a graph this one tiny step at a time!


First thing we are going to do is open a plotting space with the command [ggplot()](https://ggplot2.tidyverse.org/reference/ggplot.html)

```{r}

# invoke the ggplot plotting environmnent.

ggplot() 

```

Wow.  We have a... big square of... grey.  All it's doing is setting up our plot environment... so let's do some more...

If we want to do a histogram we are going to have to tell it what we want to print and where to get the stuff

When we add things to a plot command in Tidyverse we "add" to the steps incrementally.

This involves a "mapping" function called "[aes](https://ggplot2.tidyverse.org/reference/aes.html)" (short for aesthetics)

here, we are working with the data frame "concrete" and are working on the variable Cement which we are tossing onto the x axis because that's where the bins of cement go!

```{r}

ggplot(data = concrete) +   # EDIT:  invoke graphics environment using a given dataframe
  
  aes(x    = Cement)        # NEW: select variable to print... You can get really fancy here later

```

OK now we have something that looks like we may have the making of the graph.  If you don't like grey outlines and white grids, no worries, we can change that shortly.

OK.. we are now ready to make a histogram... 

Here we will use one of the gglot2's "geom_*" (draw stuff) resources.  The default should work for us here.

```{r}

ggplot(data = concrete) +   # invoke graphics environment using a given dataframe
  
  aes(x = Cement)   +       # select variable to print... You can get really fancy here later

  geom_histogram()          # NEW: insert histogram

```

(you may have gotten a warning about using the bin=X, you can adjust it.)

Now quickly before moving on... I am not keen on  the grey background with white lines.  

There are a number of out-of-the-box ["themes"](https://ggplot2.tidyverse.org/reference/ggtheme.html) for ggplot2.  

I'm partial to theme_bw() and theme_light() but try the ones that you prefer or stick with the default, theme_gray().  

These plots shown here are mine.  You should fidget about so they are *yours* and so you can adapt to this new way of working with data.



```{r}

ggplot(data = concrete) + # invoke graphics environment using a given dataframe
  
  theme_bw() +            # NEW: changing the plotting theme
  
  aes(x = Cement) +       # select variable to print... You can get really fancy here later

  geom_histogram()        # insert histogram (including controlling number of bins)

```

My OCD hates axes where the labels don't envelop all of the data... 

We can fix that with [xlim() or ylim()](https://ggplot2.tidyverse.org/reference/lims.html)

```{r}

ggplot(data = concrete) +     # invoke graphics environment using a given dataframe
  
  theme_bw() +                # changing the plotting theme
  
  aes(x = Cement) +           # select variable to print... You can get really fancy here later
  
  xlim( 100, 400 ) +          # NEW: adding x-axis limits

  geom_histogram()            # insert histogram

```

How about changing the color of the fill in the bars...

[You really don't want to know about all the colors you can use.](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf)

```{r}


ggplot(data = concrete) +     # invoke graphics environment using a given dataframe
  
  theme_bw() +                # changing the plotting theme
  
  aes(x = Cement) +           # select variable to print... You can get really fancy here later
  
  xlim( 100, 400 ) +          # NEW: adding x-axis limits

  geom_histogram(fill="gray") # EDIT: insert histogram (with a single chosen color)

```

Want to customize the labels and titles so we can have units?

You can add custom labels and titles!  (https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf)

For the superscripting in the x-axis label, I am using the [expression()](http://vis.supstat.com/2013/04/mathematical-annotation-in-r/) tool in R.

```{r}


ggplot(data = concrete) +     # invoke graphics environment using a given dataframe
  
  theme_bw() +                # changing the plotting theme
  
  aes(x = Cement) +           # select variable to print... You can get really fancy here later
  
  xlim( 100, 400 ) +          # adding x-axis limits

  ggtitle("Yeh Superplasticizer Tests") +          # NEW : Custom Title
  
  xlab(expression('Cement Amount (kg m'^-3*")")) + # NEW : Custom Axis Label

  geom_histogram(fill="gray") # insert histogram (with a single chosen color)

```
And I could keep tweaking this graph all day, but good enough is good enough so this is a good place to stop... 

We also can plot a few other fields with some trial and error.. 

```{r}

# Histogram of Water

ggplot(data = concrete) +     # invoke graphics environment using a given dataframe
  
  theme_bw() +                # changing the plotting theme
  
  aes(x = Water) +           # select variable to print... You can get really fancy here later
  
  xlim( 150, 250 ) +          # adding x-axis limits

  ggtitle("Yeh Superplasticizer Tests") + #Custom Title
  
  xlab(expression('Water Amount (kg m'^-3*")")) + # NEW : Custom Axis Label note use of superscripts from above

  geom_histogram(fill="blue") # insert histogram (with a single chosen color)

```

```{r}

# Histogram of Strength

ggplot(data = concrete) +     # invoke graphics environment using a given dataframe
  
  theme_bw() +                # changing the plotting theme
  
  aes(x = Compressive_Strength_28dy) + # select variable to print... You can get really fancy here later
  
  xlim( 10, 60 ) +          # adding x-axis limits

  ggtitle("Yeh Superplasticizer Tests") + #Custom Title
  
  xlab("28-dy Compressive Strength (MPa)") + # NEW : Custom Axis Label

  geom_histogram(fill="red") # insert histogram (with a single chosen color)

```

(And from our Intro to Stats Lecture...)


```{r}

# Histogram of Strength

ggplot(data = concrete) +     # invoke graphics environment using a given dataframe
  
  theme_bw() +                # changing the plotting theme
  
  aes(x = Slump) + # select variable to print... You can get really fancy here later
  
  xlim( 0, 30 ) +          # adding x-axis limits

  ggtitle("Yeh Superplasticizer Tests") + #Custom Title
  
  xlab("Slump (cm)") + # NEW : Custom Axis Label

  geom_histogram(fill="darkgreen") # insert histogram (with a single chosen color)

```

## 5.2 Distribution Plot [not so good an] Example

There are some other plots that we can use to describe our data.

Here to play with them we will take a quick step back and address that "tidy"'ed (should that say "tidied"?) dataframe "concrete_tidy"

We can now use all the parameters in the "tidy" (long) data frame to print by specific traits.


```{r}

ggplot(data = concrete_tidy) +            # invoke graphics environment using a given dataframe
  
  theme_bw() +                            # changing the plotting theme
  
  aes(x      = Value,                     # map x-axis value
      color  = Parameter) +               # map colors for different quality
  
  ggtitle("Yeh Superplasticizer Tests") + # Custom Title
  
  xlab("Value") +                         #  Custom Axis Label

  geom_density()                          # insert crete a relative density plot 

```

In the past, I've gotten good results with this but in this case, I think it's too messy in part due to the disparity in the dynamic range of our parameters. 

## 5.3. Box-Whisker Plot Example

How about leveraging a box whisker?  (I'm using only the independent variables this time.)


```{r}

ggplot(data = concrete_independent) +      # EDIT Changing dataframe
  
  theme_bw( ) +                            # changing the plotting theme
  
  theme(axis.text.x = element_blank()) +   # adding an extra trait to the x-axis
                                           # to not print labels on the x-axis 
                                           # (the labels overlap and doesn't look
                                           # pretty...)
  
  aes(y      = Value,                     # map y-axis value
      x      = Parameter,                 # map x-axis value
      color  = Parameter) +               # map colors for different quality
  
  ggtitle(label    = "Yeh Superplasticizer Tests",
          subtitle = "Concrete Test Components") + # Custom Title
  
  ylab(expression('Amount (kg m'^-3*")")) + # EDIT : Changing Custom Axis Label

  geom_boxplot()                          # insert crete a relative density plot 

```

What about our dependant variables?  We can start by changing the data frame...

```{r}

ggplot(data = concrete_dependent) +      # EDIT Changing dataframe
  
  theme_bw( ) +                            # changing the plotting theme
  
  theme(axis.text.x = element_blank()) +   # adding an extra trait to the x-axis
                                           # to not print labels on the x-axis 
                                           # (the labels overlap and doesn't look
                                           # pretty...)
  
  aes(y      = Value,                     # map y-axis value
      x      = Parameter,                 # map x-axis value
      color  = Parameter) +               # map colors for different quality
  
  ggtitle(label    = "Yeh Superplasticizer Tests",
          subtitle = "Concrete Test Results") + # Custom Title
  
  ylab("Values") +

  geom_boxplot()                          # insert crete a relative density plot 

```

Want units?  That's a little tougher here since the units differ by parameter.  We can force the values to into new names though.

```{r}

ggplot(data = concrete_dependent) +      # EDIT Changing dataframe
  
  theme_bw( ) +                            # changing the plotting theme
  
  theme(axis.text.x = element_blank()) +   # adding an extra trait to the x-axis
                                           # to not print labels on the x-axis 
                                           # (the labels overlap and doesn't look
                                           # pretty...)
  
  aes(y      = Value,                     # map y-axis value
      x      = Parameter,                 # map x-axis value
      color  = Parameter) +               # map colors for different quality
  
  ggtitle(label    = "Yeh Superplasticizer Tests",
          subtitle = "Concrete Test Results") + # Custom Title
  
  ylab("Values") +

  # NEW: It says scale color but "color" is how we are distinguishing
  #      out boxplots (as seen in the mapping/aes command)
  #      we can then use the same plot order above to rewrite the labels
  #      (likewise we could change the plot order and of coruse the colors.)
  scale_color_discrete(labels = c("Slump (cm)",
                                  "Flow (cm)", 
                                  "28dy-Compresional Stress (mPa)")) + 
  
  geom_boxplot() # insert crete a relative density plot 
  

```

## 5.4. Violin Plot Example

How about leveraging a "violin" plot?  A violin plot's width swells in areas with more observations and contracts with sparser data so it is like looking at a probability distribution.


```{r}

ggplot(data = concrete_independent) +      # EDIT Changing dataframe
  
  theme_bw( ) +                            # changing the plotting theme
  
  theme(axis.text.x = element_blank()) +   # adding an extra trait to the x-axis
                                           # to not print labels on the x-axis 
                                           # (the labels overlap and doesn't look
                                           # pretty...)
  
  aes(y      = Value,                     # map y-axis value
      x      = Parameter,                 # map x-axis value
      color  = Parameter) +               # map colors for different quality
  
  ggtitle(label    = "Yeh Superplasticizer Tests",
          subtitle = "Concrete Test Components") + # Custom Title
  
  ylab(expression('Amount (kg m'^-3*")")) + #  Changing Custom Axis Label

  geom_violin(scale="width") # EDIT: change to a violin plot 
                             #   the width argument 
                             # gives every plot the same width
  

```
and...
```{r}

ggplot(data = concrete_dependent) +      # EDIT Changing dataframe
  
  theme_bw( ) +                            # changing the plotting theme
  
  theme(axis.text.x = element_blank()) +   # adding an extra trait to the x-axis
                                           # to not print labels on the x-axis 
                                           # (the labels overlap and doesn't look
                                           # pretty...)
  
  aes(y      = Value,                     # map y-axis value
      x      = Parameter,                 # map x-axis value
      color  = Parameter) +               # map colors for different quality
  
  ggtitle(label    = "Yeh Superplasticizer Tests",
          subtitle = "Concrete Test Results") + # Custom Title
  
  ylab("Values") +

  # NEW: It says scale color but "color" is how we are distinguishing
  #      out boxplots (as seen in the mapping/aes command)
  #      we can then use the same plot order above to rewrite the labels
  #      (likewise we could change the plot order and of coruse the colors.)
  scale_color_discrete(labels = c("Slump (cm)",
                                  "Flow (cm)", 
                                  "28dy-Compresional Stress (mPa)")) + 
  

  geom_violin(scale="width") # EDIT: change to a violin plot 
                             #   the width argument 
                             # gives every plot the same width  

```
This is basically the above "density" plot but "looking down" as with a box plot.  Also here we are trimming the plot so that when we leave the range of any of the data points, the "violins" are truncated.

## 5.5. Stacked Column or Bar Plot Example

We also can do bar plots or stacked column plots.  The one produced here shows the combined components by test unit.

```{r}

ggplot(data = concrete_independent) +      # EDIT Changing dataframe
  
  theme_bw( ) +                            # changing the plotting theme

  
  aes(x     = Test_Number,
      y     = Value,
      fill  = Parameter) +               # map colors for different quality
  
  ggtitle(label    = "Yeh Superplasticizer Tests",
          subtitle = "Concrete Test Components") + # Custom Title
  
  ylab(expression('Amount (kg m'^-3*")")) + #  Changing Custom Axis Label

  geom_col(position = "stack",  # new, create a stacekd column graph 
           width    = 1.0    )  # with no space between columns

```


# 6. Correlation of Variables

## 6.1. Correlating and then Fitting Cement to Compressive Strength

Let's start by doing a "simple"" plot .  In this case since I already know the answer because the spreadsheet also has a table of how well our independent variables correlate against the dependent variables (e.g., Slump, Flow, or in our case Strength).  The Cement correlates the best against Compressive Strength (OK, truth be told, it correlates the least badly).

We can actually do this with a correlate function, [cor()](https://www.rdocumentation.org/packages/stats/versions/3.4.3/topics/cor)...

To grab a value in the table "concrete" we call the data frame (concrete) and the variable name (Cement or Water vs Compressive_Strength_28dy), separating the frame and variable names by a $ sign.


```{r}

print("Cement vs Compressive Strength Correlation, r")

cor(x = concrete$Cement,                    # the x-value 
    y = concrete$Compressive_Strength_28dy, # the y-value
    method = "pearson"                      # method of correlation
    )

```

or if you like to do everything at once...

```{r}

# calculate all correlation values against each other

correlation_matrix = cor(x      = concrete, # using our dataframe to correlate evything
                         method = "pearson" )

tbl_df(correlation_matrix)

```

Lots of numbers... not all that insightful on their own... 

You also can graph the look-n-feel of what all of the different correlations are... (it works best with a much smaller number of variables)

```{r}

  # draw a coorelation graphic...

  corrplot(corr   = correlation_matrix,
           type   = "upper")

```
We can now see for example that cement, slag, and fly ash amounts have a nominal but not thrilling correlation to compression strength while water has a good correlation with the resulting slump values.  One thing that this does *not* show is how well these parameters play with other parameters.  As we'll see when all of our independent values are working together we'll discover that cement and water, followed by fly ash and coarse aggregates will, together, contribute the most of our independent parameters in calculating the compressive strength.

## 6.2. Scatter Plot Example

But for now, let's plot plot the Cement amount against Compressive Strength

```{r}

# Making a simple X-Y scatterplot.

ggplot(data = concrete) +                # invoke graphics environment using a given dataframe
  
  theme_bw( ) +                           # changing the plotting theme
  
  aes(x      = Cement,                       # x-value
      y      = Compressive_Strength_28dy) +  # y-value

  ggtitle("Yeh Superplasticizer Tests") +    # Custom Title
  
  xlab(expression('Cement Amount (kg m'^3*")")) +   # x-label
  ylab("28-dy Compressive Strength (MPa)")      +   # y-label

  geom_point(colour="grey")   # EDIT: plot points the color keyword part was
                              #       writen by an anglophile!

```

Here's a cute trick:  Could we color those dots by a variable?

Sure!

```{r}

# Making a simple X-Y scatterplot now coloured by another parameter

ggplot(data = concrete) +                # invoke graphics environment using a given dataframe
  
  theme_bw( ) +                           # changing the plotting theme
  
  aes(x      = Cement,                       # x-value
      y      = Compressive_Strength_28dy,    # y-value
      color  = Superplasticizer)          +  # ADD: we can color by a variable!

  ggtitle("Yeh Superplasticizer Tests") +    # Custom Title
  
  xlab(expression('Cement Amount (kg m'^3*")")) +   # x-label
  ylab("28-dy Compressive Strength (MPa)")      +   # y-label

  geom_point() +  # plot points 
  scale_color_distiller(palette = "Spectral") # NEW: pick a custom "colour" palate.

```

Love overkill without any distinct numerical score and look at how everything in your data set correlates with every other variables...? 

Try [pairs()](https://www.rdocumentation.org/packages/graphics/versions/3.5.1/topics/pairs)

(I like the corrplot function better!)

```{r}

# way too many tiny plots!

pairs(x   = concrete, # do everything in the dataframe
      pch = ".")      # plot dots (the default is circles)

```

(Obviously the more variables in your dataframe the messier it gets!)



## 6.3. Creating our linear model and "calibrating" it

We weren't all that thrilled with the correlation between these components and strength but let's go ahead and demonstrate a regression.

But let's move on and create a regression model from this.  

Here we will use the [lm()](https://www.rdocumentation.org/packages/stats/versions/3.4.3/topics/lm) (linear model) function from the MASS package.

For the regression formula 

$\widehat{y}(x) = {\alpha_0}+{\alpha_1}\ x$

or

$\widehat{Strength}(concrete) = {\alpha_0}+{\alpha_1}\ concrete$

the "prototype" (formula) for the function is written as ... 

"Y ~ X" (with the y-intercept implicit in the formula... you don't put it in but it'll be there when you're done.)

The above syntax is works like this....

Dependent Variable  [~ is a function of ] Independent Variable [and any other parameter you need gets added with a plus]

If this were a $\widehat{y}(x)={\alpha_0}+{\alpha_0}\ x^3$, then the prototype for the function would be y ~ x^3

This will hopefully make more sense as we continue!

*(lm and similar linear regression functions don't play well with units.)*

```{r}

linear_model.S_v_c =  lm(formula = Compressive_Strength_28dy ~ Cement, # your formula y ~ x
                         data    = concrete)                           # the data frame
```

Let's see what we have...  This summary command will provide the details of the lm() function's important results

For us we want to see the Y-Intercept [the (Intercept) under "Estimate"] and the slope that goes with our independent value ("Concrete" under "Estimate") 

The Standard Error of the Estimate is there (Residual Standard Error) as is the Coefficient of Determination (Multiple R-squared)

We'll talk about a few of the other features when we do the larger multivariate regression

```{r}

 summary(object = linear_model.S_v_c)

```

In the above output, the asterisk identify the most significant independent variables.   Here it's trivial even though this is a terrible relationship between cement and strength.  Later we will use all of our available independent variables and the use of these asterisks will become more important.


Want to plot it?  

Good news?  

Like Excel, you have some automated features to give you quick satisfaction and happiness.  More still, it will give you confidence limits.

For this we use an extension to the graphics package called [geom_smooth()](https://ggplot2.tidyverse.org/reference/geom_smooth.html)

```{r}

# Making a simple X-Y scatterplot and adding a regression to it

ggplot(data = concrete) +                # invoke graphics environment using a given dataframe
  
  theme_bw( ) +                           # changing the plotting theme
  
  aes(x      = Cement,                       # x-value
      y      = Compressive_Strength_28dy) +  # y-value

  ggtitle("Yeh Superplasticizer Tests") +    # Custom Title
  
  xlab(expression('Cement Amount (kg m'^-3*")")) +   # x-label
  ylab("28-dy Compressive Strength (MPa)")      +   # y-label

  geom_point(colour="darkgrey") +  # plot points
  geom_smooth(method  = "lm",    # use a simple linar model
              formula = y ~ x,   # lm-style formula
              se      = TRUE,    # splay Confidence Intervals
              level   = 0.95,    # Confidene Level to Map Out
              colour  = "black", # regression line color
              size    = 0.5)     # line thickness

```

The line here looks like a positive correlation between the cement amount and the resulting strength.

Let's try water:


```{r}

# getting the linear model


linear_model.S_v_w =  lm(formula = Compressive_Strength_28dy ~ Water, # your formula y ~ x
                         data    = concrete   )                           # the data frame

summary(linear_model.S_v_w)
```

```{r}

# Making a simple X-Y scatterplot and adding a regression to it

ggplot(data = concrete) +                # invoke graphics environment using a given dataframe
  
  theme_bw( ) +                           # changing the plotting theme
  
  aes(x      = Water,                      # x-value
      y      = Compressive_Strength_28dy) +  # y-value

  ggtitle("Yeh Superplasticizer Tests") +    # Custom Title
  
  xlab(expression('Water Amount (kg m'^-3*")")) +  # x-label
  ylab("28-dy Compressive Strength (MPa)")      +   # y-label

  geom_point(colour="darkblue") +  # plot points
  
  geom_smooth(method  = "lm",    # use a simple linar model
              formula = y ~ x,   # lm-style formula
              se      = TRUE,    # splay Confidence Intervals
              level   = 0.95,    # Confidene Level to Map Out
              colour  = "blue",  # regression line color
              fill    = "cyan",  # NEW: fill for confidence limits
              size    = 0.5)     # line thickness

```

Looking up back the tables none of the variables 

# 7. Multivariate Linear Regression

And now we're going to do something about that!

We're now going to use not just one independent variable... but all 7 of them!

The good news is that it follows the same form as the simple linear regression.  This time we string along all of our independent variables with in our formula prototype.

Our formula now has multiple independent values but still follows the same style of solution...

$\widehat{y}(\mathbf{x}) = {\alpha_0}+{\alpha_1} x_1 + {\alpha_2} x_2 + {\alpha_2} x_3  + ... +{\alpha_n} x_n$ 



```{r}

linear_model.S_v_all <- lm(data    = concrete,                             # your data frame
                           formula = Compressive_Strength_28dy ~ Cement +  # your formula
                                                                 Slag +
                                                                 Fly_Ash +
                                                                 Water +
                                                                 Superplasticizer +
                                                                 Fine_Aggregates +
                                                                 Coarse_Aggregates)  


```

And here are these results... 

```{r}

summary(object = linear_model.S_v_all)

```

Our regression coefficients are still here under the "Estimate" column as are our Standard Error of our Estimate and our Coeff of Determination.

Also we can now take a good look at those asterisks at the end of line with the parameter coefficients.  These can explain which independent variables do the heaviest lifting in our regression. The more asterisks, the more important the dependent variable is to the larger multivariate regression. Here, we can see that the Cement and Water are doing most of the "work" in fitting our suite of independent variables to our dependent variable of Compressive Strength.

Finally there is the P parameter for which the smaller it is, the better we can say that the relationship that we've made with our regression represents our dependent variable.

Now... on to looking at our results.

Here is where viewing the results of the regression is tricky.

We have 7 independent variables but we'd like to see the impact of the fit if all 7 variables on our strength 

When I do this I like to plot the true y value against my regression y(x1,x2,x3,..)

So to do this I will take the fitted values of y and plot them against the original values of y

Getting the fitted values is easy.  

I'm using the get_regression_points function which adds the modeled "y-hat" value to the dataframe of all of the other values [get_regression_points()](https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/fitted) function.

The fitted version is the dependent variable w/ a "_hat"" at the end


```{r}

fitted.S_v_all = get_regression_points(model = linear_model.S_v_all)

print(fitted.S_v_all)

```


And finally we can plot our actual vs modeled values.  (I'm adding a trend line)


```{r}


# Making a simple X-Y scatterplot and adding a regression to it

ggplot(data = fitted.S_v_all) +           # invoke graphics environment using a given dataframe
  
  theme_bw( ) +                           # changing the plotting theme
  
  aes(x      = Compressive_Strength_28dy,    # x-value
      y      = Compressive_Strength_28dy_hat) +  # y-value

  ggtitle("Yeh Superplasticizer Tests",
          subtitle = "28-dy Compressive Strength (MPa)") +    # EDITED: Custom Title now with a subtitle
  
  ylab("Modelled")     + # y-label
  xlab("Observed")     + # x-label

  geom_point(colour="darkred") +  # plot points
  
  geom_smooth(method  = "lm",      # use a simple linar model
              formula = y ~ x,     # lm-style formula
              se      = TRUE,      # display Confidence Intervals
              level   = 0.95,      # Confidene Level to Map Out
              colour  = "red",     # regression line color
              fill    = "magenta", # fill for confidence limits
              size    = 0.5)  +    # line thickness
  
  geom_abline(slope     = 1,       # NEW: add a very simple line
              intercept = 0,       #  (for a 1:1 reference)
              color     = "grey",
              linetype  = "dashed") +

  coord_fixed(ratio = 1)           # NEW: make the aspect ratio 
                                   #   (I like my plots square)
```

And here we have a nice plot showing our true vs predicted values.

# 8. Regression Quality Metrics

And to close things off, we can do some general error metrics that may be useful..

First, the Mean Squared Error (MSE) or Bias... (if we are too high or too low)

$BIAS = MSE = \frac{1}{N}  \sum_{i=1}^{n} [\widehat{y}(\overrightarrow{x_i})-y_i]  =    \overline{[\widehat{y}(\overrightarrow{x_i})-y_i]}$

```{r}
  # Calculate Bias (MSE)

  bias = mean(fitted.S_v_all$Compressive_Strength_28dy_hat - 
                 fitted.S_v_all$Compressive_Strength_28dy)
  
  print(str_c(" Mean Squared Error (MSE) or Bias: ", bias))
```
For a linear or multivariate regression the average of our residuals (the difference between each observation and prediction) *should* be zero.

The root mean squared error (RMSE) is shown here.  It shouldn't be zero since the residuals are squared before summing them up.  We technically should use the standard error of the estimate, but RMSE remains a common error metric.   We can always do both.  The standard error of the estimate takes into account the degrees of freedom which which now includes all of the independent variables (p).  We can get the standard error of the estimate from our 

$RMSE = \sqrt{ \frac{1}{N}  \sum_{i=1}^{n} [\widehat{y}(\overrightarrow{x_i})-y_i]^2 } = \sqrt{\overline{[\widehat{y}(\overrightarrow{x_i})-y_i]^2}     }$

$s_{e}$ or $s_{y/x} = \sqrt{ \frac{1}{N-p-1}  \sum_{i=1}^{n} [\widehat{y}(\overrightarrow{x_i})-y_i]^2 }$


```{r}
  # Calculate RMSE

  rmse = sqrt(mean( (fitted.S_v_all$Compressive_Strength_28dy_hat -
                       fitted.S_v_all$Compressive_Strength_28dy)^2)  )
  
  print(str_c("     Root Mean Squared Error (RMSE): ",  
              rmse))
  print(str_c("Standard Error of the Estimate (se): ", 
              summary(linear_model.S_v_all)$sigma))  # you have to dig for this one!
```

And finally our correlation coefficient (which is basically our coefficient of determination before the "R" is "squared")

```{r}
  # Get The Unadjusted Correlation Coefficient

  r = cor(x = fitted.S_v_all$Compressive_Strength_28dy,     # the x-value 
          y = fitted.S_v_all$Compressive_Strength_28dy_hat, # the y-value
          method = "pearson"                                # method of correlation
          )
  
  print(str_c("                        correlation coefficient (r): ", r))
  print(str_c("                  coefficient of determination (r²): ", r^2, 
                                                                 " ", 
                                 summary(linear_model.S_v_all)$r.squared))
  print(str_c("adjusted coefficient of determination (Adjusted r²): ", 
               summary(linear_model.S_v_all)$adj.r.squared))


```


And with that, we're done... Once again, this exercise demonstrates a lot of tricks just to show how you can use R for various statistics.  You may not use all of them in your encouters with R for linear or multivariate regression or even at all, but you may be able to cannibalize some of the tricks here for other applications.


